Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix source dataset issue when running link jobs #1193

Merged
merged 3 commits into from
Apr 20, 2023

Conversation

ThomasHepworth
Copy link
Contributor

Quick Summary

It looks as if there was a bug in vertically_concatenate.py that was causing two source_dataset columns to be produced in various steps, where source_dataset was outlined in the settings object.

This was causing issues in DuckDB, where I couldn't use a vertically concatenated dataframe with a source_dataset column for a link job -- i.e. a single df with source_dataset which outlines which dataset a record belongs to.

But, more troubling than this, it appears that this was just outright breaking link only jobs in spark. I'll post some code that breaks below.

Essentially though, it was causing this behaviour in concat_with_tf, which was causing spark to throw a wobbly.

Screenshot 2023-04-18 at 22 25 58

This code fixes the issue by migrating _source_dataset_column_name to the linker class and checking whether the column already exists within the user's database.


Internals and why I opted to go down this path:

To start - these changes don't adjust the underlying SQL/logic that's being used wherever I've replaced linker._settings_obj_source_dataset_column_name with linker._source_dataset_column_name.

The workflow is still:

  1. Use the alias table name provided by the user to create a new column in concat_with_tf
  2. Use this to filter where required

The change is in the naming convention used and what we output in predict().

Now if the users provides a dataframe that already contains source_dataset, splink will adjust step 1 to use the alias __splink_source_dataset. This will then be scrapped when it's no longer needed and the user will be left with their original source_dataset in the output.

If the users provides a df without source_dataset, splink won't fall over and will just call step 1 source_dataset.

Why this way?

  1. It doesn't fall over in the no source_datasetcase
  2. It means we don't need any over the top adjustments to the splink internals
  3. It means that even if the users hasn't supplied a truly unique source_dataset column, the link job will still run.

On point 3 - We may want to check the source dataset column in preprocessing and establish if it's valid.

@github-actions
Copy link
Contributor

Test: test_2_rounds_1k_duckdb

Percentage change: -29.9%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1575 2023-04-18 21:46:00 1.33526 1.31404 (detached head) c121eb9 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz c121eb9

Test: test_2_rounds_1k_sqlite

Percentage change: -24.1%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1577 2023-04-18 21:46:00 3.23778 3.23334 (detached head) c121eb9 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz c121eb9

Click here for vega lite time series charts

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a sensible change to me - just one comment re: tests

tests/test_full_example_duckdb.py Show resolved Hide resolved
),
],
)
def test_link_only(input, source_l, source_r):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test! Really helps understand the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants